Visual Mining of Powersets with Large Alphabets

نویسندگان

  • Tamara Munzner
  • Qiang Kong
  • Raymond T. Ng
  • Jordan Lee
  • Janek Klawe
  • Dragana Radulovic
  • Carson K. Leung
چکیده

We present the PowerSetViewer visualization system for the lattice-based mining of powersets. Searching for items within the powerset of a universe occurs in many large dataset knowledge discovery contexts. Using a spatial layout based on a powerset provides a unified visual framework at three different levels: data mining on the filtered dataset, browsing the entire dataset, and comparing multiple datasets sharing the same alphabet. The features of our system allow users to find appropriate parameter settings for data mining algorithms through lightweight visual experimentation showing partial results. We use dynamic constrained frequent set mining as a concrete case study to showcase the utility of the system. The key challenge for spatial layouts based on powerset structure is handling large alphabets, because the size of the powerset grows exponentially with the size of the alphabet. We present scalable algorithms for enumerating and displaying datasets containing between 1.5 and 7 million itemsets, and alphabet sizes of over 40,000.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Relationship of algebraic theories to powersets over objects in Set and Set × C

7 This paper deals with a particular question—When do powersets in lattice-valued mathematics form algebraic theories (ormonads) in clone form?Our approach in this and related papers is to consider “powersets over objects” in the ground categories Set and Set×C 9 from the standpoint of algebraic theories in clone form (C is a particular subcategory of the dual of the category of semi-quantales)...

متن کامل

Frequent Contiguous Pattern Mining Algorithms for Biological Data Sequences

Transaction sequences in market-basket analysis have large set of alphabets with small length, whereas bio-sequences have small set of alphabets of long length with gap. There is the difference in pattern finding algorithms of these two sequences. The chances of repeatedly occurring small patterns are high in bio-sequences than in the transaction sequences. These repeatedly occurring small patt...

متن کامل

A Top-Down Method for Mining Most Specific Frequent Patterns in Biological Sequence Data

The emergence of automated high-throughput sequencing technologies has resulted in a huge increase of the amount of DNA and protein sequences available in public databases. A promising approach for mining such biological sequence data is mining frequent subsequences. One way to limit the number of patterns discovered is to determine only the most specific frequent subsequences which subsume a l...

متن کامل

A Top-down Approach for Mining Most Specific Frequent Patterns in Biological Sequence Data

The emergence of automated high-throughput sequencing technologies has resulted in a huge increase of the amount of DNA and protein sequences available in public databases. A promising approach for mining such biological sequence data is mining frequent subsequences. One way to limit the number of patterns discovered is to determine only the most specific frequent subsequences which subsume a l...

متن کامل

Learning Regular Languages over Large Alphabets

Learning regular languages is a branch of machine learning, which has been proved useful in many areas, including artificial intelligence, neural networks, data mining, verification, etc. On the other hand, interest in languages defined over large and infinite alphabets has increased in recent years. Although many theories and properties generalize well from the finite case, learning such langu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005